Skip to content

Conversation

@Frando
Copy link
Member

@Frando Frando commented Nov 11, 2025

Description

Fixes #3642

This moves discovery handling fully into the EndpointStateActor.
The pub(crate) interface to trigger discovery and get a EndpointMappedAddr is now Magicsock::resolve_remote, which sends the provided addresses to the EndpointStateActor. The actor starts discovery if it does not have a selected path and if discovery is not running. It returns either immediately if there are any known paths, or waits for discovery to produce at least one result or an error. Once this returns, resolve_remote returns either with a EndpointMappedAddr or with the discovery error.

This means the current behavior is kept: We only start quinn::Endpoint::connect once we have at least one transport address for the remote. If not, we return the discovery error immediately from iroh::Endpoint::connect.

This opens the door for us to easily tune when to run discovery in other siutations, e.g. when all available paths to a remote are closed. However, for now this PR still only starts discovery when Endpoint::connect is called and no path is selected at the moment.

Breaking Changes

Notes & open questions

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.
    • Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:

@github-actions
Copy link

github-actions bot commented Nov 11, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3645/docs/iroh/

Last updated: 2025-11-18T15:06:00Z

@n0bot n0bot bot added this to iroh Nov 11, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Nov 11, 2025
Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess most tests needs to be moved the the StaticProvider?

I don't quite follow the WantConnect stuff. Is this trying to make sure you can return an error straight from calling Endpoint::connect when you don't have any discovery results? As otherwise you'd start connecting and then the connection would time out?

Comment on lines 302 to 303
// Prune our own addreses from the endpoint address.
// TODO: Move this somewhere else?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please!

The EndpointStateActor should collect these addresses as usual but when choosing addresses to send to when handling SendDatagram it should filter out any that are itself. That would mean the possible remote addresses are not lost when you move network. The EndpointStateActor already has access to the local DirectAddrs so this is way more logical.

@Frando
Copy link
Member Author

Frando commented Nov 11, 2025

I don't quite follow the WantConnect stuff. Is this trying to make sure you can return an error straight from calling Endpoint::connect when you don't have any discovery results?

Yes, exactly. It mimics the current behavior that quinn::Endpoint::connect is only started once we have at least one dialable remote address, and if discovery errors or returns no results, we return straight from iroh::Endpoint::connect without touching quinn at all. The alternative would be that all connections time out if discovery yields nothing.

Edit: I renamed it to ResolveRemote which is much clearer IMO.

@Frando Frando force-pushed the Frando/mp-discovery branch from 6189b7f to 196d89a Compare November 11, 2025 16:22
@github-actions
Copy link

github-actions bot commented Nov 12, 2025

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 889dfc0

@Frando Frando marked this pull request as ready for review November 12, 2025 10:04
@Frando Frando requested a review from flub November 12, 2025 10:04
@flub flub linked an issue Nov 13, 2025 that may be closed by this pull request
Copy link
Member

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about the local_addrs filtering. Is there some situation in particular that you saw that meant you had to add filtering?
Ideally none of our discovery services will accidentally discover local addresses.
And perhaps ideally the endpoint doesn't break too much when these addresses do end up in our EndpointStateActor::paths, right?

@Frando
Copy link
Member Author

Frando commented Nov 17, 2025

I'm a little confused about the local_addrs filtering. Is there some situation in particular that you saw that meant you had to add filtering?
Ideally none of our discovery services will accidentally discover local addresses.
And perhaps ideally the endpoint doesn't break too much when these addresses do end up in our EndpointStateActor::paths, right?

I did not add it, only moved it around. It lives here on feat-multipath. Because that whole function was removed, I moved the check over, as suggested by @flub.

@matheus23
Copy link
Member

Given that check used to happen in add_endpoint_addr, should the check perhaps be moved to the equivalent of that at EndpointStateAddr::add_addrs instead of happening every time we send_datagram?

@Frando
Copy link
Member Author

Frando commented Nov 17, 2025

Given that check used to happen in add_endpoint_addr, should the check perhaps be moved to the equivalent of that at EndpointStateAddr::add_addrs instead of happening every time we send_datagram?

This sounds good to me, but I'd like to hear @flub's opinion too.

Edit: I remember now why I moved it to handle_msg_send_datagram. The local addrs might have changed between the call to add_addr and when these addrs are used in handle_msg_send_datagram so I thought that it would be best to check against the current set of local addrs when trying to use an address, instead of filtering when adding them (against an potentially always-changing set of local addrs).


# test_utils
axum = { version = "0.8", optional = true }
sync_wrapper = { version = "1.0.2", features = ["futures"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢 another dependency just for making things sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another? which one is the other one?

Copy link
Member Author

@Frando Frando Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add it to n0-futures though. The whole crate is <300 lines.
https://docs.rs/sync_wrapper/latest/src/sync_wrapper/lib.rs.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another, to the many dependencies we already have 😓

@matheus23
Copy link
Member

Edit: I remember now why I moved it to handle_msg_send_datagram. The local addrs might have changed between the call to add_addr and when these addrs are used in handle_msg_send_datagram so I thought that it would be best to check against the current set of local addrs when trying to use an address, instead of filtering when adding them (against an potentially always-changing set of local addrs).

My suggestion would be to update EndpointStateActor::paths in right after the self.local_addrs.updated() hits in the run loop.

It's a little bit more bookkeeping, but avoids filtering in the send_datagram fn :)

.or_default()
.sources
.insert(Source::Connection { _0: Private }, Instant::now());
self.add_path_entry(path_remote, Source::Connection { _0: Private });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I'm a fan of having a helper function for this. It hides what state we're modifying. A self.paths.add_entry() would be much more amenable. I kind of think we should modify local state directly, otherwise we're building something too complex and need to tweak the abstractions.

As a similar example I had several iterations of pretty horrible state management issues before I ended up finding ConnectionState in its current form, which ended up being much more maintainable (and may need to evolve further at some point, this is by no ways some perfect abstraction).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've made it all the way to the end of the endpoint_state.rs changes and finally have managed to come up with a suggestion. You might encounter a few more comments on this topic. Not 100% sure it's the right one as I haven't tried but anyway.

I think the complexity I have issue with comes from that you need to respond to the ResolveRemote requests at two points:

  • whenever self.paths changes
  • whenever discovery found something
    And then you end up with a brittle combination of mutating state and calling hooks at the right points all over the place.

Can this not be moved into self.paths? Maybe path_state.rs can redeem itself for still being it's own module 😉 (currently a historical accident). If we had an EndpointPathState (bikeshedding welcome) that you passed the oneshot senders into. And that way you can have a clear abstraction boundary for paths being added and it can emit change responses whenever we get a new remote transport address for whatever reason.

Would that go a long way to reducing the brittleness and complexity that's being added here?

Copy link
Member Author

@Frando Frando Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would indeed be a better structure likely, yes.

I implemented this, see the latest commit on this branch.

.or_default()
.sources
.insert(source.clone(), Instant::now());
/// Adds new [`TransportAddr`] addresses to our list of potential paths.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add what this really does. Because appending to a list does not require a function. What magic happens?

name: item.provenance().to_string(),
};
let addr = item.into_endpoint_addr();
self.add_addrs(addr.addrs, source);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not calling self.emit_pending_resolve_requests here is a red flag for state management to me. Because the other branches have this. It makes things complex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now goes over the EndpointPathState, lmk if you still think the interactions are not clear enough.


/// Error returned when the endpoint state actor stopped while waiting for a reply.
#[stack_error(derive)]
#[stack_error(add_meta, derive)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh! TIL why things where the shape they were before.

@Frando Frando force-pushed the Frando/mp-discovery branch from bbb3d27 to b9aceab Compare November 17, 2025 13:40
@Frando Frando force-pushed the Frando/mp-discovery branch from 162c7c2 to 147d1a0 Compare November 17, 2025 16:01
@Frando Frando force-pushed the Frando/mp-discovery branch from b02df8a to 15d4cf4 Compare November 17, 2025 16:15
@Frando Frando force-pushed the Frando/mp-discovery branch from e8178b4 to ed14a71 Compare November 18, 2025 10:23
/// Notifies that a discovery run has finished.
///
/// This will emit pending resolve requests.
pub(super) fn discovery_finished(&mut self, error: Option<DiscoveryError>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, but doesn't res: Result<(), DiscoveryError> make more sense?

scheduled_open_path: None,
pending_open_paths: VecDeque::new(),
sender,
discovery,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So definitely nitpicking. I alsoways liked: https://rust-lang.github.io/rust-clippy/stable/index.html#inconsistent_struct_constructor

But as the lint is not enabled I'm sure you get to choose what you do :)

@Frando Frando merged commit 01545ee into feat-multipath Nov 18, 2025
26 of 28 checks passed
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in iroh Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

Improve discovery cooperation with EndpointStateActor

5 participants